Method and system for text compression and decompression

ABSTRACT

Creation and recovering of the pseudo-code (Y) form the basis of the present method of text compression and decompression. The pseudo-code (Y) is created by formula Y=C+X. The pseudo-code includes information of a repeating index/symbol (constant C) and a current index/symbol (X). The pseudo-code (Y) is converted back into original information by formula X=Y−C. To service the pseudo-code one needs to convert original symbols of text into indexes, and to create a permanent and temporary vocabulary. The present permanent vocabulary is a redundant vocabulary built in advance, includes dictionary with common symbols taken from books, articles, and dictionaries, and serves as a reference vocabulary stored in the permanent memory. The temporary vocabulary is built and is used during compression and decompression processes. The functionality of the temporary vocabulary is to convert a high bit length of indexes belonging to the permanent vocabulary into a low bit length indexes present in the temporary vocabulary.

REFERENCES CITED

-   [1] D. Huffman, “A Method for the Construction of Minimum Redundancy    Codes,” in Proc. IRE, vol. 40, no. 9, pp. 1098-1101, 1952.-   [2] Gonzalo Navarro and Mathieu Raffinot. A General Practical    Approach to Pattern Matching over Ziv-Lempel Compressed Text. Proc.    CPM'99, LNCS 1645. Pages 14-36, 1999-   [3] J. Ziv, A. Lempel, “A universal algorithm for sequential data    compression”, IEEE Transactions on Information Theory, May 1977,    Volume:23 Issue: 3, pp: 337-343,-   [4] J. Ziv and A. Lempel. Compression of individual sequences via    variable length coding. IEEE Trans. Inform. Theory, 24:530-536, 1978-   [5] U. Khurana “Text compression and Superfast Searching;-   [6] Generation Text Retrieval Systems”, IEEE Computer 33(11):37-44    (cover feature), November 2000

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to the field of compression and decompression oftext.

2. Description of the Prior Art

Compression algorithms, such as Huffman, LZ78, LZW and hundreds of othervariants of the above techniques usually exploit statistical redundancyof the English letters and give limited compression rate, which wasformulated by Claude E. Shannon. According to the Claude E. Shannon'stheory of data compression there is a fundamental limit (entropy rate)to lossless data compression. From the Claude E. Shannon lossless sourcecoding theorem, the fundamental limits for first, second, andthird-order statistical distribution of English text (the alphabet sizeis 26 letters plus space) will be 4.07, 3.36, and 2.77 bits/characterrespectively. Shannon has been able to estimate for the general model(by using a prediction method) that the entropy rate of the English textcan in theory reach 2.3 bits/character. None of the proposed abovecompression algorithms can provide results as theorized in the Cannon'sthird-order statistical distribution of English text entropy rate, withthe obtainable limit reaching 2.77 bits/character. There were a lot oftechniques proposed to increase the compression rate (absolute number)or compression ratio (relative number), e.g. a word-based Huffmancoding, where, the table of symbols in the compression coder becomes thetext vocabulary; or an Efficient Optimal Recompression; or aSemi-lossless Text Compression; or a Programmed selection of commoncharacters and pairs; or a Programmed selection of prefixes andsuffixes; or U. Khurana “Text compression and Superfast Searching;” ornew techniques explained in the patents 6,047,298 and 6,883,087; or D.Huffman's “A Method for the Construction of Minimum Redundancy Codes,”in Proc. IRE; or J. Ziv and A. Lempel's “Universal algorithm forsequential data compression.” or J. Ziv and A. Lempel. “Compression ofindividual sequences via variable length coding”. IEEE Trans. Inform.Theory, 24:530-536, 1978.

Compression techniques involve trade-offs between various factors, suchas the complexity of the designs of data compression/decompressionschemes, the ability to search a compressed text in the system withoutdecompressing it, the speed of an operation system, the consumption ofexpensive resources (i.e. storages and transmission bandwidth), thecompression capability, the time it takes to compress information, theuser's computer power, the cost of text compression due to the textcoding and decoding as well as other factors. None of the currentmethods satisfy the requirement of efficient compression anddecompression of text. Furthermore, the current methods have bothadvantages and disadvantages of implementation of different kinds ofapplications e.g. the requirement of reducing time of text decompressionand reducing the working frequency of a microprocessor of an electronicrider.

The present invention tries to resolve some restrictions of the systemsand apparatuses, which are involved in the process of coding/decoding,storing, and transmitting of text. Furthermore, the present method ofconverting any symbols into indexes permits to increase the compressionratio of the stored text, to increase the compression rate of thetransmitted text, and to reduce the cost of the receivers.

In the present invention “symbol” means letter, word, phrase, number,sentence, punctuation mark, prefix, suffix, and permanently ortemporarily made words combinations. “Index” means an address of thesymbol located in the permanent and temporary vocabularies.

SUMMARY OF THE INVENTION

An object of the present invention is to provide the compression anddecompression method of converting symbols of text into indexes by meansto compressed text and then recovering these indexes back to symbols oftext as needed.

Another object of the present invention is to provide the permanentvocabulary. The permanent vocabulary is a redundant vocabulary and itincludes dictionary with common symbols taken from thousands of books,and serves as a permanent reference vocabulary. The permanent vocabularyis created in advance of any information processing.

Still another object of the present invention is to provide thetemporary vocabulary. The present temporary vocabulary includesrepeating and common indexes/symbols (constant C) and currentindexes/symbols (X). These repeating, common and current indexes arekept in two separate parts of the temporary vocabulary, as the root oftree and main storages. The functionality of the temporary vocabulary isto convert high bit length indexes belonging to the permanent vocabularyinto low bit length indexes present in the temporary vocabulary, whichare then used to create pseudo-codes.

Still another object of the present invention is to provide an analyzer,which serves a method of creating the pseudo-code (Y) by formula Y=C+Xand recovering the pseudo-code by formula X=Y−C. Where: constant C isrepeating index/symbol and X is current index/symbol.

The features and preferences of the present method and system basedthereon will be apparent from the following description and fromaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the example of the present method of textcompression, storage, transmission, and decompression.

FIG. 2 illustrates some repeating symbols.

FIG. 3 illustrates flow chart of text compression and decompression.

FIG. 4 illustrates flow chart of text compression on a transmitter sideand text decompression on a receiver side.

FIG. 5 illustrates process of loading the temporary vocabulary.

DESCRIPTION OF THE PREFERRED METHOD AND SYSTEM

The present invention provides the method and apparatus, which convertsymbols into index, creating the pseudo-code (Y) by formula Y=C+X, andrecovering the pseudo-code into symbols by formula X=Y−C. Where: thepseudo-code includes information of repeating and common (constant C)and current (X) indexes/symbols.

The present method of text compression needs to build the permanentvocabulary, which, then is served as a reference vocabulary used indifferent applications. The present permanent vocabulary is a redundantvocabulary, and it includes dictionary with common symbols taken fromthousands of books, articles, and serves as a permanent referencevocabulary. This permanent vocabulary is created in advance of anyinformation processing.

Also the present method of text compression and decompression needs tobuild the temporary vocabulary during information processing. Thepresent temporary vocabulary includes repeating and commonindexes/symbols (constant C) and current indexes/symbols (X). Theserepeating, common and current indexes are kept in two separate parts ofthe temporary vocabulary, as the root of tree and main storages. Thefunctionality of the temporary vocabulary is to convert high bit lengthindexes belonging to the permanent vocabulary into low bit lengthindexes present in the temporary vocabulary, which are then used tocreate pseudo-codes.

The present method of text compression and decompression is illustratedin FIG. 1. Where: Y—pseudo-code; constant C—repeating index (repeatingindex holds the root of tree storage of temporary vocabulary); X—currentindex (current index holds the main storage of temporary vocabulary);the sentence is taken from the patent 6,227,354.

In this embodiment the original symbols of a sentence are converted intorepeating and current indexes (C1-C6, C8, and X1-X8) and, then, theseindexes are converted into pseudo-codes (Y1-Y8) by formula Y=C+X. Thenthe organized pseudo-codes are transmitted, and, then, the pseudo-codesare recovered into indexes/symbols as constant C and current X byformula X=Y−C on the receiver side. Depending on the application,indexes in the root of tree and main storages of the temporaryvocabulary are stored with or without corresponding symbols. Allcomputed data for this example are shown in FIG. 1. In this example, themaximum number of pseudo-codes Y8 will be about 70. It means that thetransmitted pseudo-code should be at least seven bits. The compressionratio for the present text will be calculated next. Assuming that inorder to build the root of tree storage of the temporary vocabulary theprocessor needs to spend about 42 bits=7 repeating symbols*6 bits persymbol; to build the main storage of the temporary vocabulary theprocessor needs to spend about 104 bits=(12 symbols*7 bits per loadingsymbol)+(5 hybrid symbols*4 bits per symbol); to transmit thepseudo-codes needs to spend about 56 bits=8 pseudo-code*7 bits pertransmitted pseudo-code. The uncompressed text includes 1096 bits=137characters including spaces and punctuation marks*8 bits per character.For above assuming the compression ratio will be about81.5%=[1096−(42+104+56)/1096]*100. According to the present method oftransmission of indexes, the transmission rate and the compression ratioare additionally increased because indexes cover not only symbols butalso punctuation marks and spaces. (For the current method of wordrecognition, the spaces should be present).

In another example assume: the pseudo-code has a length of about 16 and22 bits, the pseudo-code covers 15 characters, the pseudo-code repeatsabout 1 and 5 times, loading pseudo-code's two indexes into thetemporary vocabulary takes about 22 bits. For a 16 bit pseudo-code andfor 1 and 5 repeating indexes, the compression ratio will be about68.3%, and 83%, respectively. For a 22 bit pseudo-code and for 1 and 5repeating indexes, the compression ratio will be about 63.3%, and 78%,respectively. The analysis of the above examples demonstrates that thecompression ratio depends on the size of bits needed to make a temporaryvocabulary, a pseudo-code length, a quantity of characters covered bythe pseudo-code, and the number of repeating pseudo-codes. In thepresent invention the compression ratio is increased by a proposedapparatus, such as an analyzer. In the present analyzer, the programanalyzes the source of text's indexes, makes decisions of transmissionof an optimum amount of pseudo-codes, of using symbols from the root oftree and main storages of the temporary vocabulary, of the length of thetransmitted pseudo-codes and of the variable length of loading indexesinto the temporary vocabulary. Furthermore, the present analyzerutilizes a spider program, which is used for reduction of the number ofloaded indexes into the temporary vocabulary and pseudo-codes by storingand transmitting code (algorithm) instead of indexes. For example, thespider program permits symbols, such as ‘a’ or ‘the’ or ‘on the’, ‘inthe’ and etc. to be combined with a word that follows, for example,‘street’, to form hybrid symbols ‘a street’ and ‘the street’,respectively, and puts them into a memory location of the root of thetree or the main storages. In another example, the spider program makesthe temporary vocabulary by combining verbs such as “said,” “asked”,“answered”, “continued”, “replied”, and “cried” with nouns from thetext, by putting the combined symbols into a new storage location ofroot of tree and/or main storages. Still in another example, the spiderprogram increases the number of indexes in the temporary vocabulary byusing command “add”, which adds punctuation marks, such as “,” “!”, “.”,“?” and others to the group of indexes belonging to the main storage.The present spider program makes some redundancy to the quantity ofindexes in the temporary vocabulary.

FIG. 2 illustrates some common symbols (constant C). In this exampleonly the characters of repeating and common symbols are shown.

According to the present method of text compression/decompression, boththe transmitter and the receiver sides have to store a referencepermanent vocabulary. The present permanent vocabulary is astatistically and functionality amount of symbols taken from thousandsbooks, articles, and dictionaries. The present permanent vocabulary, forexample, may include several sections, such as section 1 with symbolsand most common usable words, section 2—nouns, section 3—verbs, section4—adjective, section 5—numbers, section 6—names, section 7—wordsrepresented by summary of group of characters, and section 8—languages.Section 7 is used when words not include in the permanent vocabulary.The length of word index, which not present in the permanent vocabulary,will be summary of each length of indexes of section 7. For example,word “stuttering”, which not present in the permanent vocabulary,separated to parts as “stut+ter+ing”. The summary of length of itscorresponding indexes of separated parts from section 7 will be presentword “stuttering”. The section 7 includes symbols itself and symbolswith spaces. The symbols with spaces permit to recognize words withoutusing special logic or additional command. Section-3 include alldictionary verbs and symbols, for example, as “have saved, has saved,not saved, have been saved, and being saved,” etc.

It is understood that exemplary of the permanent vocabulary based on thedescribed herein may be implemented in variety of different applicatione.g. an electronic rider book, or book on CD, or Internet (internet isused a permanent vocabulary with symbols were taken from thousandsbooks, articles, and professional dictionaries), or for foreignersriders (a permanent vocabulary includes not only a translator but also adefinition dictionary).

FIG. 3 illustrates flow chart of text compression and decompression.

Where: 1—source text in the form of characters; 2—permanent vocabulary;3—source text in the form of indexes; 4—analyzer; 5—pseudo-codes;6—compressed temporary vocabulary; 7—storage of compressed targetedtext; 8—recovered temporary vocabulary; 9—uncompressed target text inthe form of characters.

In this embodiment the process of text compression includes thefollowing steps:

1. Converting the source text in the form of characters 1 intocorresponding indexes 3 through the permanent vocabulary 2;2. Counting repeating, none-repeating, and common indexes;3. Temporary storing repeating, none-repeating, and common indexes;4. Compressing high bit corresponding indexes belonging to the temporaryvocabulary (see FIG. 5) and then organizing these compressed indexes 6for storing in the storage 7.5. Making an internal recovered temporary vocabulary (not shown) whichis then used to make pseudo-codes by formula Y=C+X;6. Organizing pseudo-codes in the form of compressed text 5 for storingin the storage 7;Steps 1-6 are all done by the analyzer 4.

The process of text decompression includes the following steps:

1. Making the recovered temporary vocabulary 8 by process of convertingthe compressed indexes belonging to the temporary vocabulary 6 into highbit corresponding indexes and/or original symbols belonging to thepermanent vocabulary 2.2. Recovering pseudo-codes into indexes/symbols by the formula X=Y−C.The process of recovering pseudo-codes into symbols of the uncompressedTarget text (character) involve the recovered temporary vocabulary 8.

FIG. 4 illustrates a flow chart of text compression text on thetransmitter side and of text decompression on a receiver side.

Where: 1—the source text in the form of characters; 2—permanentvocabulary; 3—source text in the form of indexes; 4—analyzer;5—transmission line; 9—pseudo-codes; 10—compressed temporary vocabulary;11—storage of compressed targeted text; 12—permanent vocabulary;13—recovered temporary vocabulary; 14—uncompressed target text in theform of characters.In this embodiment the process of text compression includes thefollowing steps:1. converting the source text in the form of characters 1 intocorresponding indexes 3 through the permanent vocabulary 2.2. Counting repeating, none-repeating, and common indexes.3. Temporary storing repeating, none-repeating, and common indexes.4. Compressing high bit corresponding indexes belonging to the temporaryvocabulary (see FIG. 5) and then organizing these compressed indexes 10for storing in the storage 11.5. Making an internal recovered temporary vocabulary (not shown) whichis then used to make pseudo-codes 9.6. Organizing pseudo-codes in the form of compressed text for storing inthe storage 11.

Steps 1-6 are all done by the analyzer 4.

In the present system the internal recovered temporary vocabulary on thetransmitter side and the recovered temporary vocabulary on the receiverside should be identical.

The process of text decompression on the receiver side includes thefollowing steps:

1. Making the recovered temporary vocabulary 13 by process of convertingthe compressed indexes belonging to the temporary vocabulary 10 intohigh bit corresponding indexes and/or original symbols belonging to thepermanent vocabulary 12.3. Recovering pseudo-codes 9 into indexes/symbols by the formula X=Y−C.The process of recovering pseudo-codes into symbols of the uncompressedTarget text 14 involves the recovered vocabulary 13.

The algorithm of loading the temporary vocabulary may include commands,such as “Code”, “Next”, and “Stop”. The command “Code” includes data ofthe variable length of loading index in bits. For example, forvocabulary, which includes 8,000,000 symbols the maximum length of indexwill be 23 bits. The command “Code” also may include data for searchingtext, for example, by Goggle engine. The loading index length depends ondensity of the matched indexes in the blocks, but not on the size of thepermanent vocabulary. The above order of the loading index lengths isillustrated in the present example in FIG. 5. The command “Next” is usedfor changing the loading index length. The command “Stop” indicates theend of the process of loading indexes into any section of the temporaryvocabulary.

In addition, the algorithm of loading the temporary vocabulary includessteps of making the temporary vocabulary without sending loadingindexes. This step of making the temporary vocabulary is served by aspider program. The algorithm of loading the temporary vocabulary alsoincludes the step of making a hybrid symbol (e.g. phrases) from symbolscontained in the temporary vocabulary. This step is done by acommunication program. FIG. 1 illustrates some hybrid symbols whichmakes by the communication program, such as “idler shaft”, “endlessconveyor”, armored face”, “scraper chain conveyor” or phrase taken frompatent 6,227,354 as “The shaft supporting assembly according to claim”(repeats 9 times).

The variable length of indexes, order of using commands, spider andcommunication programs define the analyzer.

FIG. 5 illustrates the process of loading the temporary vocabulary.

Where: 1, 3—permanent vocabularies; 2—compressed text storage, whichincludes the loaded indexes into the temporary vocabulary, commands, andpseudo-codes; 4—recovered temporary vocabulary.

In order to better understand the process of loading the temporaryvocabulary let me present the process of loading the word “friend”. Thisprocess includes the following steps:

1. Finding the word “friend” in the permanent vocabulary 1 and thenconverting its matched word “friend” into the high bit correspondingindex (15867).2. Lowering the length of the high bit corresponding index 15867 (14bits) into low bit index 567 (10 bits).3. Storing or transmitting the low bit index 567.4. Decompressing the low bit index 567 into the high bit index 15867 andcorresponding word “friend”.

5. Loading the high bit index 15867 or/and the corresponding word“friend” into the temporary vocabulary 4.

6. Renumbering the loaded index 15867 by new index organization as 134.The new index number is a constant C and/or current X, which is thenused in the formula of creating the pseudo-code.

In the present example the variable indexes lengths of the first andsecond blocks contain 4 bits (13 indexes), of the third block contain 8bits (250 indexes) and of the last block contain 10 bits (1020 indexes).The loading process of symbols into the temporary vocabulary starts fromlow or high bit length up to the finish loading symbols in the temporaryvocabulary.

According to the present method of text compression/decompression thepresent system should be a synchronized system. It means thatapparatuses for compression and decompression text should use the samepermanent and temporary vocabularies, commands, and algorithms.

The benefit of the present method of text compression is a highcompression ratio and high compression rate. Another benefit of thepresent method of text compression is that text compression anddecompression can be done independently from each other. These benefitspermit to use, for example, a low cost electronic rider for readingelectronic books or high speed analyzer, which is used for fastcompression/decompression of text on the Internet. Still another benefitis the ability to search a compressed text in the system withoutdecompressing the full text. The searching process, for example, mayinclude steps of: converting a searched word into high bit length index;converting the high bit length index into low bit length compressedindex; searching for matching low bit compressed index with low bitcompressed index contained in the main storage of the temporaryvocabulary. Still another benefit of the present method of textcompression is the ability to compress large text, as well as, smalltext e.g. one sentence with high compression ratio.

1. A method and system for compressing and decompressing text whichincludes the following steps: creation of a permanent vocabulary;conversion of symbols into the corresponding high bit length indexes;creation of a temporary vocabulary; creation of pseudo-codes;arrangement of pseudo-codes for storage and transmission; and recoveryof pseudo-codes into uncompressed target text in the form of characters.2. Creation of a permanent vocabulary of claim 1 is comprised of thefollowing steps: symbols and their corresponding indexes are taken fromthousands of books and dictionaries in advance and are used as areference vocabulary in the future; the permanent vocabulary is splitinto various functional sections.
 3. Conversion of symbols into a highbit length corresponding indexes of claim 1 is comprised of matchingsymbols from the source text with those in the permanent vocabulary andconverting these symbols into high bit length corresponding indexes. 4.Creation of a temporary vocabulary of claim 1 is comprised of thefollowing steps: finding repeating and common symbols in the sourcetext; converting these repeating and common symbols into a high bitlength corresponding indexes; compressing these high bit length indexes;storing or transmitting the compressed high bit length indexes;decompressing stored or transmitted compressed high bit length indexesinto high bit length indexes and corresponding symbols; loading theorganized high bit length indexes and/or corresponding symbols into thetemporary vocabulary; renumbering the loaded indexes; splitting atemporary vocabulary into two sections as a root of tree and mainstorages.
 5. A pseudo-code of claim 1 is created by formula Y=C+X andrecovered into uncompressed target text in the form of characters byformula X=Y−C and as said keeps information of two indexes and theircorresponding symbols, such as a repeating and common index/symbol(constant C) and current index/symbol (X)
 6. An analyzer, which is usedfor searching symbols in the text; for converting these symbols intohigh bit length corresponding indexes; for counting repeating,none-repeating, and common indexes; for temporary storing repeating,non-repeating, and common indexes; for making a temporary vocabulary;for creating pseudo-codes; for organizing a compressed temporaryvocabulary and pseudo-codes for storage and/or transmission; forreduction of amount of transmitted indexes; and for servicing spider andcommunication programs.